EACL 2009 Proceedings of the EACL 2009 Workshop on Language Technologies for African Languages
نویسندگان
چکیده
We describe the Lwazi corpus for automatic speech recognition (ASR), a new telephone speech corpus which includes data from nine Southern Bantu languages. Because of practical constraints, the amount of speech per language is relatively small compared to major corpora in world languages, and we report on our investigation of the stability of the ASR models derived from the corpus. We also report on phoneme distance measures across languages, and describe initial phone recognisers that were developed using this data.
منابع مشابه
EACL 2009 Proceedings of the EACL 2009 Workshop on Computational Linguistic Aspects of Grammatical Inference
ii Preface We are delighted to present you with this volume containing the papers accepted for presentation at the We want to acknowledge the help of the PASCAL 2 network of excellence. Thanks also to Damir´Cavar for giving an invited talk and to the programme committee for the reviewing and advising. We are indebted to the general chair of EACL 2009, Alex Lascarides, to the publication chairs,...
متن کاملProceedings of the EACL 2009 Workshop on GEMS : GEometrical Models of Natural Language Semantics Endorsed by the Association for Computational Linguistics
We propose an approach to corpus-based semantics, inspired by cognitive science, in which different semantic tasks are tackled using the same underlying repository of distributional information, collected once and for all from the source corpus. Task-specific semantic spaces are then built on demand from the repository. A straightforward implementation of our proposal achieves state-of-the-art ...
متن کاملLanguage ID in the Context of Harvesting Language Data off the Web
As the arm of NLP technologies extends beyond a small core of languages, techniques for working with instances of language data across hundreds to thousands of languages may require revisiting and recalibrating the tried and true methods that are used. Of the NLP techniques that has been treated as “solved” is language identification (language ID) of written text. However, we argue that languag...
متن کاملRevisiting Multi-Tape Automata for Semitic Morphological Analysis and Generation
Various methods have been devised to produce morphological analyzers and generators for Semitic languages, ranging from methods based on widely used finitestate technologies to very specific solutions designed for a specific language or problem. Since the earliest proposals of how to adopt the elsewhere successful finite-state methods to root-andpattern morphologies, the solution of encoding Se...
متن کاملThe Universität Karlsruhe Translation System for the EACL-WMT 2009
In this paper we describe the statistical machine translation system of the Universität Karlsruhe developed for the translation task of the Fourth Workshop on Statistical Machine Translation. The state-ofthe-art phrase-based SMT system is augmented with alternative word reordering and alignment mechanisms as well as optional phrase table modifications. We participate in the constrained conditio...
متن کامل